This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
require("jsonlite")
## Loading required package: jsonlite
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
require("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
require("dplyr")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require("ggplot2")
## Loading required package: ggplot2
require("knitr")
## Loading required package: knitr
require("dplyr")
require("extrafont")
## Loading required package: extrafont
## Registering fonts with R
CREATE TABLE Alaska_OilWell (
– Change table_name to the table name you want.
api_number varchar2(4000),
operator_name varchar2(4000),
well_name varchar2(4000),
status varchar2(4000),
parent varchar2(4000),
cik varchar2(4000),
serialid varchar2(4000),
latitude number(38,4),
longitude number(38,4)
);
CREATE TABLE Baltimore_Salary (
– Change table_name to the table name you want.
name varchar2(4000),
jobtitle varchar2(4000),
agencyid varchar2(4000),
agency varchar2(4000),
hiredate varchar2(4000),
serialid varchar2(4000),
annualsalary number(38,4),
grosspay number(38,4)
);
CREATE TABLE Texas_Salaries (
– Change table_name to the table name you want.
name varchar2(4000),
jobtitle varchar2(4000),
agencyid varchar2(4000),
agency varchar2(4000),
hiredate varchar2(4000),
annualsalary varchar2(4000),
grosspay varchar2(4000),
annual_salary number(38,4),
serialid number(38,4)
);
This is actually not only a Scatterplot between “Longitude” and “Latitude”; this is more of a plot/ map demonstrating the geographic locations/ distributions of different oil companies’ oil wells. From this scatterplot, we can see that ARCO ALASKA INC and EXXON CORP are managing the majority of oil wells operating on sea. The Northest oil well belongs to SHELL WESTERN INC with Longitude and Latitude of approximately (-165.81, 71.85). The Southernest oil well is operated by CITY OF AKUTAN with Longitude and Latitude of (-165.85, 54.15). The most important information we can get from this scatterplot is that we can clearly tell, from the distributions of oil wells that the concentration of oil and gas is mainly at the north and south borders of Alaska State.
From this scatterplot, we can tell that except Police Department, Fire Department, Waste Water, Law Department are clustering and do not have a clear relationship between Annual Salary and Gross Salary, other roles in the following agencies, Circuit Court, City Council, Community Relations, Company Audits, Company Controllers/ Communications, DFW-Administration etc. all share a high positive correlation coefficient and can be accurately modeled with a linear model.
We find that if $40,000 and $67,510 are the parameters that determine “Low”, “Medium”, “High”, then very few poeple’s salaries fall in the range over $67,510, and the majority of salaries of Baltimore employees are categorized as “Low” and “Medium”.
If we adopt the same KPI parameters and we only look at SUPERVISOR roles, we find that the majority of such roles make salaries in the high range.
From this bar chart, we see that the average number of companies that are in each corporation is about 550. BP PLC is one of the corporations that has the most number of companies, which is about 4300. However, the rest of these corporations are below the reference line.
source("../03 R SQL Visualizations/Alaska_OilWell_Scatterplot.R", echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > df <- data.frame(fromJSON(getURL(URLencode("skipper.cs.utexas.edu:5001/rest/native/?query=\"select * from alaska_oilwell\""),
## + httpheader = c( .... [TRUNCATED]
##
## > ggplot(data = df, aes(x = LONGITUDE, y = LATITUDE,
## + color = OPERATOR_NAME), width = 300, height = 300) + geom_point() +
## + theme(axis.tick .... [TRUNCATED]
Similar to how we did in tableau, this scatterplot shows all the oil companies are in America, and let us know which are the major companies among all. Also, I added the theme function inside ggplot to hide the values along the x and y axis because there are so many companies have been included in this dataset, and it’s better to make a plot without showing the axis.
source("../03 R SQL Visualizations/Baltimore_Salaries_Crosstab.R", echo = TRUE)
##
## > KPI_Low_Max_value = 40000
##
## > KPI_Medium_Max_value = 80000
##
## > BALTIMORE_SALARY <- data.frame(fromJSON(getURL(URLencode("skipper.cs.utexas.edu:5001/rest/native/?query=\"select * from BALTIMORE_SALARY\""),
## + .... [TRUNCATED]
##
## > df <- BALTIMORE_SALARY %>% group_by(AGENCY, JOBTITLE) %>%
## + mutate(average = mean(ANNUALSALARY)) %>% mutate(KPI = ifelse(average <=
## + KPI_ .... [TRUNCATED]
##
## > ggplot() + coord_cartesian() + scale_x_discrete() +
## + scale_y_discrete() + labs(title = "Baltimore Salaries Crosstab\n") +
## + labs(x = past .... [TRUNCATED]
In this plot, we created 2 variables, one was the KPI Low Max value and the other was a KPI Medium Max value. The purpose for creating these 2 variables because It’s easier to see which job position earns higher salary. The value that I set for KPI Low Max was 40,000 and was 80,000 for KPI Medium Max. Any value higher then 80,000 will be considered as a high. By looking at this plot, it’s really hard to see the crosstab, because of the text that appears on every block. And The text represents the average salary of each job position.
source("../03 R SQL Visualizations/Texas_Salaries_Barchart.R", echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > TEXAS_SALARIES <- data.frame(fromJSON(getURL(URLencode("skipper.cs.utexas.edu:5001/rest/native/?query=\"select * from TEXAS_SALARIES\""),
## + htt .... [TRUNCATED]
##
## > df <- TEXAS_SALARIES %>% group_by(DEPARTMENT) %>%
## + mutate(AVERAGE = mean(ANNUAL_SALARY)) %>% mutate(REF_LINE = mean(AVERAGE))
##
## > ggplot() + coord_cartesian() + coord_flip() + facet_wrap("DEPARTMENT",
## + ncol = 1) + labs(x = paste("TITLE"), y = paste("ANNUAL_SALARY")) +
## + .... [TRUNCATED]
In this plot, we try to discover the average annual salary in each department. From the plot, we see that the jobs in Texas Department of Criminal Justice seems to get paid higher than the other departments. However, its average isn’t as high as jobs are in Texas Department of Transportation. Also, there are more jobs offered in Texas Department of Criminal Justice than the others. The red reference lines represents the average of annual salary in each Department.
source("../03 R SQL Visualizations/Blending_Texas&Baltimore_Hiredate.R", echo = TRUE)
##
## > require("jsonlite")
##
## > require("RCurl")
##
## > require("dplyr")
##
## > BALTIMORE_SALARY <- data.frame(fromJSON(getURL(URLencode("skipper.cs.utexas.edu:5001/rest/native/?query=\"select * from BALTIMORE_SALARY\""),
## + .... [TRUNCATED]
##
## > TEXAS_SALARY <- data.frame(fromJSON(getURL(URLencode("skipper.cs.utexas.edu:5001/rest/native/?query=\"select * from TEXAS_SALARIES\""),
## + httph .... [TRUNCATED]
##
## > df <- data.frame(dplyr::left_join(BALTIMORE_SALARY,
## + TEXAS_SALARY, by = "SERIALID"))
##
## > head(df)
## NAME.x JOBTITLE.x AGENCYID.x
## 1 Aaron,Patricia G Facilities/Office Services II A03031
## 2 Aaron,Petra L ASSISTANT STATES ATTORNEY A29005
## 3 Abaineh,Yohannes T EPIDEMIOLOGIST A65026
## 4 Abdal-Rahim,Naim A EMT Firefighter A64063
## 5 Abdi,Ezekiel W POLICE OFFICER A99398
## 6 Abdul Hamid,Umar LIQUOR BOARD INSPECTOR II A09001
## AGENCY.x HIREDATE.x SERIALID ANNUALSALARY.x GROSSPAY.x
## 1 OED-Employment Dev 10/24/79 1 50845 45505.94
## 2 States Attorneys Office 9/25/06 2 56595 51588.89
## 3 HLTH-Health Department 7/23/09 3 56500 50633.26
## 4 Fire Academy Recruits 3/30/11 4 33476 3888.95
## 5 Police Department 6/14/07 5 50919 51421.73
## 6 Liquor License Board 1/17/95 6 43278 39116.93
## NAME.y JOBTITLE.y AGENCYID.y
## 1 Aaron,Patricia G Facilities/Office Services II A03031
## 2 Aaron,Petra L ASSISTANT STATES ATTORNEY A29005
## 3 Abaineh,Yohannes T EPIDEMIOLOGIST A65026
## 4 Abdal-Rahim,Naim A EMT Firefighter A64063
## 5 Abdi,Ezekiel W POLICE OFFICER A99398
## 6 Abdul Hamid,Umar LIQUOR BOARD INSPECTOR II A09001
## AGENCY.y HIREDATE.y ANNUALSALARY.y GROSSPAY.y
## 1 OED-Employment Dev 10/24/79 50845 45505.94
## 2 States Attorneys Office 9/25/06 56595 51588.89
## 3 HLTH-Health Department 7/23/09 56500 50633.26
## 4 Fire Academy Recruits 3/30/11 33476 3888.95
## 5 Police Department 6/14/07 50919 51421.73
## 6 Liquor License Board 1/17/95 43278 39116.93
## ANNUAL_SALARY
## 1 null
## 2 null
## 3 null
## 4 null
## 5 null
## 6 null
Here I blend the data using the shared SERIALID. There will be two people’s salarty information, one from Baltimore and another from Texas, on each row of one SERIALID number. And I called head of the dateframe for a snapshot of the blended dataset.
Here I blend the BaltimoreSalary and TexasSalary through the Hire Date dimension. And compare the salaries of these two states year by year. Apparently, each year the sum of Baltimore Salary is higher than the sum of Texas Salary with no exception. The difference is highest in 2006 and lowest in 1991 and 1998. Also the trend for the salaries of both states is that the sums are steadily increasing with fluctuations between 1981 and 2001.